Merge latest changes from main to 'Documentation' branch #192

rsareddy0329 · 2025-08-05T23:05:11Z

PR Approval Steps

For Requester

Description
- Check the PR title and description for clarity. It should describe the changes made and the reason behind them.
- Ensure that the PR follows the contribution guidelines, if applicable.
Security requirements
- Ensure that a Pull Request (PR) does not expose passwords and other sensitive information by using git-secrets and upload relevant evidence: https://github.com/awslabs/git-secrets
- Ensure commit has GitHub Commit Signature
Manual review
1. Click on the Files changed tab to see the code changes. Review the changes thoroughly:
  - Code Quality: Check for coding standards, naming conventions, and readability.
  - Functionality: Ensure that the changes meet the requirements and that all necessary code paths are tested.
  - Security: Check for any security issues or vulnerabilities.
  - Documentation: Confirm that any necessary documentation (code comments, README updates, etc.) has been updated.
Check for Merge Conflicts:
- Verify if there are any merge conflicts with the base branch. GitHub will usually highlight this. If there are conflicts, you should resolve them.

For Reviewer

Go through For Requester section to double check each item.
Request Changes or Approve the PR:
1. If the PR is ready to be merged, click Review changes and select Approve.
2. If changes are required, select Request changes and provide feedback. Be constructive and clear in your feedback.
Merging the PR
1. Check the Merge Method:
  1. Decide on the appropriate merge method based on your repository's guidelines (e.g., Squash and merge, Rebase and merge, or Merge).
2. Merge the PR:
  1. Click the Merge pull request button.
  2. Confirm the merge by clicking Confirm merge.

Co-authored-by: adishaa <[email protected]>

… with minor improvements and bug fixes (#137)

… with minor improvements and bug fixes. (#139)

…and ux (#136)

…ception count data (#140)

* manual release v3.0.1

…alarm fix (#147)

… regionalized HMA URI (#141)

* Add unique time string to integ test * Update syntax

* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <[email protected]>

* Update inferenece SDK examples * Update readme

* Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * CLI: Enable Telemetry * CLI: Enable Telemetry --------- Co-authored-by: Roja Reddy Sareddy <[email protected]>

…102)

* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed

Co-authored-by: pintaoz <[email protected]>

* Update inference config and integ tests * Update integ tests for new canaries

* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <[email protected]>

* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally

…8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes.

* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries

…189) Co-authored-by: pintaoz <[email protected]>

* Update documentation-with-new-changes branch with latest changes from main (#190) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> --------- Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * Documentation Fixes (#191) Co-authored-by: Roja Reddy Sareddy <[email protected]> * update documentation with new changes branch with latest changes (#194) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> --------- Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * Documentation Fixes (#195) * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Documentation Fixes (#197) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Documentation Fixes (#198) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Documentation fixes (#199) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> --------- Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]>

…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <[email protected]>

* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin

…holder value (#206) Co-authored-by: Mohamed Zeidan <[email protected]>

* Update PR template * Update template * Update template format * Update format * Fix readme

* delete cluster stack * delete cluster stack * removed unnecessary file * unit tests * more modular code * refactored modular code * sdk code added and improved modularity * cleanup * removed silent failure for sdk * fixed unit tests * integ tests * 2 integ happycase tests * changed test to use iam role instead of s3 bucket --------- Co-authored-by: Mohamed Zeidan <[email protected]>

* Code Coverage for Integ Tests * Making sure target of coverage is correct * Removing duplicate implementation

… with minor improvements and bug fixes. (#265) 1. New feature NVML API Check to detect hardware failure. Disabled Nvidia SMI query check 2. HMA will be able to detect File system read only error 3. For compatibility with AL2023, Non-NVIDIA devices will use a separate daemonset for deployment.

* slurm-eks-helper-fix * Small fix to test to reflect new changes

* First draft integ tests * Mini fixes to ensure integ tests work * Allow integ tests to run from clean directory * Change torch job creation namespace to default

…h_create function, update unit test (#243)

* decouple template from src code * update unit tests for init * remove field validator from SDK pydantic model, fix minor parsing problem with list, update kubernetes_version type from str to float * Update pyproject.toml for cluster stack template to include json, update read_only to be boolean * change type handler from class to module functions, change some public function to private, update unit tests * update create for pytorch job template, remove redundant integ test code for init

* return SDK class in pytorch model.py for v1_0 and v1_1, update pytorch_create function, update unit test * remove name and namespace from create for inference SDK to match with training SDK, functionality remains the same * fix unit test, add metadata class usage to example notebook, remove skip test * fix unit test again * update integ tests * update create call

* decouple template from src code * remove field validator from SDK pydantic model, fix minor parsing problem with list, update kubernetes_version type from str to float * change type handler from class to module functions, change some public function to private, update unit tests * cluster-stack template agnostic change * update unit tests * update integ test * resolve circular import for cluster_stack * resolve rebase merge conflict * rename to_domain to to_config for cluster_stack * increase timeout for endpoint integ test from 15min to 20min

* decouple template from src code * remove field validator from SDK pydantic model, fix minor parsing problem with list, update kubernetes_version type from str to float * change type handler from class to module functions, change some public function to private, update unit tests * cluster-stack template agnostic change * update unit tests * update integ test * resolve circular import for cluster_stack * resolve rebase merge conflict * rename to_domain to to_config for cluster_stack * increase timeout for endpoint integ test from 15min to 20min * move jinja template to schema template * lazy loading in pytorch-job template to resolve import issue * tasks_per_node validation added, correct typo for task governance related parameter * get default namespace applied to inference for init experience, ignore pydantic warning, update logging experience * update integ test * fix integ test * Update default namespace logic, init_constants.py naming change * update unit test

Co-authored-by: Mohamed Zeidan <[email protected]>

* add telemetry to init experience, remove duplicate code in init_constants * add filter for deprecation warning, fix hyp --version * change default instance group name for instance group settings

…ce launch (#249) * Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes. (#254) * changelog version update (#256) Co-authored-by: Mohamed Zeidan <[email protected]> * Fix README documentation and broken anchor links (#252) **Description** - Updated README.md to fix broken internal navigation links, corrected SDK import paths, added proper syntax highlighting to code blocks. - Fixed training SDK imports, observability utils import path, and cluster management workflow examples. **Testing Done** - Verified all anchor links work correctly in table of contents and usage sections - Cross-referenced SDK imports against actual source code in src/sagemaker/hyperpod/ - Validated CLI commands match implementation in hyp_cli.py - Confirmed code examples use correct class names and method signatures * Small bug fix to print debug messages for inference logger (PySDK) (#246) * Draft of inference logger bug fix * Draft fix of inference logger for SDK * Revert adding --debug flag * Add debug parameter to failing unit tests * Fix create_from_dict to not have hardcoded debug flag * Add code-coverage workflow to GitHub workflows (#257) * Add code coverage workflow * Update artifact version to v4 * Fixed report upload * Simplified workflow using tox.ini * Make sure coverage is on right source files * Bug fix for 0 percent code coverage error * Bump version to 3.2.2 (#260) * Bump version to 3.2.2 **Description** Update package version from 3.2.1 to 3.2.2 in pyproject.toml and setup.py files. **Testing Done** Version bump only - no functional changes requiring additional testing. * Changelog update for v3.2.2 **Description** Added detaisl for Health Monitoring Agent updates to changelog **Testing Done** Production canary failure fixes validated. * Changelog update for v3.2.2 **Description** Updated the release date to represent the correct date. **Testing Done** No breaking changes. * Bump hyperpod-pytorch-job-template to v1.1.2 **Description** Update hyperpod-pytorch-job-template version from 1.1.1 to 1.1.2 and add changelog entry for node-count validation revert. **Testing Done** Version bump and changelog update - node-count validation revert functionality verified. * Update readme to include review guidelines (#261) * Update PR template * Update template * Update template format * Update format * Fix readme * Feature: Delete Cluster Command (#250) * delete cluster stack * delete cluster stack * removed unnecessary file * unit tests * more modular code * refactored modular code * sdk code added and improved modularity * cleanup * removed silent failure for sdk * fixed unit tests * integ tests * 2 integ happycase tests * changed test to use iam role instead of s3 bucket --------- Co-authored-by: Mohamed Zeidan <[email protected]> * Code Coverage for Integ Tests (#262) * Code Coverage for Integ Tests * Making sure target of coverage is correct * Removing duplicate implementation * Release new version for Health Monitoring Agent (1.0.819.0_1.0.267.0) with minor improvements and bug fixes. (#265) 1. New feature NVML API Check to detect hardware failure. Disabled Nvidia SMI query check 2. HMA will be able to detect File system read only error 3. For compatibility with AL2023, Non-NVIDIA devices will use a separate daemonset for deployment. * Removing duplicate cluster-creating integ test (#266) * Access entry fix (#267) * Fix Slurm failures from missing orchestration key (#268) * slurm-eks-helper-fix * Small fix to test to reflect new changes * small fix after resolving merge conflict --------- Co-authored-by: Xichao Wang <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: papriwal <[email protected]> Co-authored-by: aviruthen <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: jiayelamazon <[email protected]>

…h new documentation (#250) * add example notebooks for init experience, update README to match with new documentation * clear output

Co-authored-by: Mohamed Zeidan <[email protected]>

* Added .venv to .gitignore * Add venv/ to .gitignore * Added .venv to .gitignore * Default values for accelerators implemented * Added memory validation and default accelerators values * Setting default values for memory, vcpu, and accelerators * Unit tests for new quota_allocation functions * Refactoring default values * Unit and integration tests complete * Fix for default cpu values * Refactoring and clean up * Accounting for accelerators when min values provided * Refactoring and clean up * Increased default buffer for mem and cpu. Refactored _resolve_ functions * Refactored and added more unit tests * Additional function for default values created. Refactored some unit tests to account for new default values * Refactoring and test additions * Implemented regressive resource scaling for cpu and memory * Refactoring of unit and integ tests * Small change for a unit test * Increasd reserved resources amounts * Small refactoring --------- Co-authored-by: Sean Archer <[email protected]>

Co-authored-by: ytlee93 <[email protected]>

* Update jinja template handling logic for inference and training, cluster logic remaining for discussion * test inference and training all parameters * minor change to fix integ * move create_from_k8s_yaml to init_utils and init_constants, reuse create for create_from_dict for inference SDK, update template kind * Fix unit test * update create and create_from_dict for inference to share internal_create, revert unit test changes fix

* Update cluster creation template url with versioning * update tests * add cli parameter * Update tests * Fix unit test * update custom s3 name * update default_create * Update storage parameter * update defaults --------- Co-authored-by: pintaoz <[email protected]>

… with minor improvements and bug fixes. (#286) 1. Update the Health Monitoring Agent to be compatible with Nvidia MIG

Aditi2424 and others added 25 commits July 18, 2025 12:24

Update telemetry status to be Integer for parity (#130)

223af40

Co-authored-by: adishaa <[email protected]>

Release new version for Health Monitoring Agent (1.0.643.0_1.0.192.0)…

cf77296

… with minor improvements and bug fixes (#137)

Release new version for Health Monitoring Agent (1.0.674.0_1.0.199.0)…

0342f60

… with minor improvements and bug fixes. (#139)

update inference CLI describe command print for better visualization …

631ddf9

…and ux (#136)

Update inference integ test to add dependency to improve telemetry ex…

dc440c3

…ception count data (#140)

Manual release v3.0.1 (#143)

cc08405

* manual release v3.0.1

change security-monitoring metrics data destination to us-east-2 for …

079fafd

…alarm fix (#147)

feat: Add region detection to install Health Monitoring Agent and use…

29a16c5

… regionalized HMA URI (#141)

Add unique time string to integ test (#150)

66232ed

* Add unique time string to integ test * Update syntax

update example notebook for inference CLI (#151)

9fbec4a

Training: Main documentation update (#153)

8034a24

* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <[email protected]>

Update inferenece SDK examples (#155)

0bcee6d

* Update inferenece SDK examples * Update readme

update help text to avoid truncation (#158)

d2130e9

Add an option to disable the deployment of KubeFlow TrainingOperator (#…

293f9b9

…102)

Remove unused param from documentation (#170)

9f534b4

Update volume flag to support hostPath and pvc (#171)

ec8800d

* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed

Restructure list-cluster output (#173)

95e073e

Co-authored-by: pintaoz <[email protected]>

Update inference config and integ tests (#167)

a8a2baf

* Update inference config and integ tests * Update integ tests for new canaries

Update readme for volume flag (#176)

2908a62

Manual release v3.0.2 (#177)

9b7220c

* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <[email protected]>

Add schema pattern check to pytorch-job template (#178)

36fac66

* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally

Fix training test (#184)

dcbc8fb

* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries

Update logging information for submitting and deleting training job (#…

28424e4

…189) Co-authored-by: pintaoz <[email protected]>

rsareddy0329 requested a review from a team as a code owner August 5, 2025 23:05

rsareddy0329 and others added 4 commits August 6, 2025 13:51

Added new column 'deploymeny configs' to the itable that allows user'…

6553766

…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <[email protected]>

Add instance type support for ml.p6e-gb200.36xlarge (#204)

63ff3b4

* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin

changed endpoint name from value user has to manually insert to place…

e3f697a

…holder value (#206) Co-authored-by: Mohamed Zeidan <[email protected]>

zhaoqizqwang and others added 30 commits September 10, 2025 15:09

Update readme to include review guidelines (#261)

458bd63

* Update PR template * Update template * Update template format * Update format * Fix readme

Code Coverage for Integ Tests (#262)

dffcc3d

* Code Coverage for Integ Tests * Making sure target of coverage is correct * Removing duplicate implementation

Removing duplicate cluster-creating integ test (#266)

5c42bcd

Access entry fix (#267)

0b1bc8f

Fix Slurm failures from missing orchestration key (#268)

da2df2f

* slurm-eks-helper-fix * Small fix to test to reflect new changes

Bump versions for release (#270)

d08aefb

Update CHANGELOG.md (#274)

7421a76

Integration tests for init experience (#242)

3ee6d51

* First draft integ tests * Mini fixes to ensure integ tests work * Allow integ tests to run from clean directory * Change torch job creation namespace to default

return SDK class in pytorch model.py for v1_0 and v1_1, update pytorc…

6ed3031

…h_create function, update unit test (#243)

delete cluster functionality (#247)

7c09e6a

Co-authored-by: Mohamed Zeidan <[email protected]>

Add telemetry and dog fooding fixes (#248)

4b1e0fb

* add telemetry to init experience, remove duplicate code in init_constants * add filter for deprecation warning, fix hyp --version * change default instance group name for instance group settings

add example notebooks for init experience, update README to match wit…

8bc72cf

…h new documentation (#250) * add example notebooks for init experience, update README to match with new documentation * clear output

Fix test_hp_endpoint create_from_dict test

a3e7efa

Fix tox.ini to fix coverage issue

75c601b

Fix tox.ini to fix coverage issue

315f7ec

added describe cluster cmd (#278)

9891db4

Co-authored-by: Mohamed Zeidan <[email protected]>

Update aws-efa-k8s-device-plugin version to 0.5.10 (#282)

dc2096a

Update README.md

7c185e3

Add ml.p5.4xlarge instance type support (#283)

c5edf2d

Co-authored-by: ytlee93 <[email protected]>

Release new version for Health Monitoring Agent (1.0.935.0_1.0.282.0)…

0ae955c

… with minor improvements and bug fixes. (#286) 1. Update the Health Monitoring Agent to be compatible with Nvidia MIG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge latest changes from main to 'Documentation' branch #192

Merge latest changes from main to 'Documentation' branch #192

Uh oh!

rsareddy0329 commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Merge latest changes from main to 'Documentation' branch #192

Are you sure you want to change the base?

Merge latest changes from main to 'Documentation' branch #192

Uh oh!

Conversation

rsareddy0329 commented Aug 5, 2025

PR Approval Steps

For Requester

For Reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants